We will quickly examine an event with all data included from Kaggle. This is event number 9999, the last event in the whole dataset. Chosen because we will run classification on it later and therefore it shouldn't be in the training set of earlier events.
event_file = '/global/cscratch1/sd/danieltm/ExaTrkX/trackml/train_all/event000009999'
hits, particles, truth = trackml.dataset.load_event(
event_file, parts=['hits', 'particles', 'truth'])
We will leave noise hits in for this intro:
pt_min = 0
vlids = [(8,2), (8,4), (8,6), (8,8),
(13,2), (13,4), (13,6), (13,8),
(17,2), (17,4)]
n_det_layers = len(vlids)
# Select barrel layers and assign convenient layer number [0-9]
vlid_groups = hits.groupby(['volume_id', 'layer_id'])
hits = pd.concat([vlid_groups.get_group(vlids[i]).assign(layer=i)
for i in range(n_det_layers)])
# Calculate PER-HIT particle transverse momentum
pt = np.sqrt(truth.tpx**2 + truth.tpy**2)
truth = truth.assign(pt=pt)
# Calculate derived hits variables
r = np.sqrt(hits.x**2 + hits.y**2)
phi = np.arctan2(hits.y, hits.x)
# Select the data columns we need
hits = (hits[['hit_id', 'x', 'y', 'z', 'layer']]
.assign(r=r, phi=phi)
.merge(truth[['hit_id', 'particle_id', 'pt']], on='hit_id'))
We now have a hit list that has cylindrical co-ordinates, and...
len(np.unique(hits.particle_id))
unique tracks, including one "track" of all noise hits with particle ID = 0. We can visualise some of these tracks...
plt.figure(figsize=(10,10))
for pid in np.unique(hits.particle_id)[:20]:
pid_hits = hits[hits['particle_id'] == pid]
size = 1 if pid==0 else 50
plt.scatter(pid_hits.x, pid_hits.y, s=size)
# time.sleep(1)
where the blue points are noise hits, and the colors denote hits from the same track.
Consider the pT distribution:
sns.distplot(hits[hits.pt < 3].pt)
That is, most of the hits are from tracks below 0.5 GeV. These may be misleading, because this is counting HITS, not tracks. We can histogram over tracks:
sns.distplot(hits[hits.pt < 3].groupby('particle_id')['pt'].mean())
low_pt_hits = hits[hits.pt < 0.3]
plt.figure(figsize=(10,10))
for pid in np.unique(low_pt_hits.particle_id)[10:20]:
pid_hits = hits[hits['particle_id'] == pid]
size = 1 if pid==0 else 50
plt.scatter(pid_hits.x, pid_hits.y, s=size)
To make our job as hard as possible, let's try to classify a track that is low pT, AND has duplicate hits on a layer.
layer_count = low_pt_hits.groupby(['particle_id', 'layer']).count()
duplicate_pids = layer_count[layer_count.hit_id > 1]
unique_duplicate_pids = np.unique([row[0] for row in duplicate_pids.index])
Out of the total number of tracks, there are...
len(unique_duplicate_pids)
Low pT tracks with duplicate hits. These look like...
plt.figure(figsize=(10,10))
for pid in unique_duplicate_pids[-10:]:
pid_hits = hits[hits['particle_id'] == pid]
size = 1 if pid==0 else 50
plt.scatter(pid_hits.x, pid_hits.y, s=size)
Look at that beautiful red track with multiple duplicates and a huge curvature. It has particle ID #846680715692089346 and event ID #9999, but let's call it Brian. Our task will be to classify all tracks correctly, but focus on Brian as a tough case.
brian = 846680715692089346
We will try multiple methods of constructing a graph from the hit data. For these constructions, we will work with no pT cut or duplicate restriction - all real hits are fair game. However, we will remove noise.
# Geometric and physics cuts
pt_min = 0.
phi_slope_max = .001
z0_max = 200
# Graph features and scale
feature_names = ['r', 'phi', 'z']
n_phi_sections = 8
feature_scale = np.array([1000., np.pi / n_phi_sections, 1000.])
We load the set of TrackML data, and run the constructor to get training and testing data. Training from the front, testing from the back. For visualisation purposes, we will just look at one test example, event #9999.
def calc_eta(r, z):
theta = np.arctan2(r, z)
return -1. * np.log(np.tan(theta / 2.))
def calc_dphi(phi1, phi2):
"""Computes phi2-phi1 given in range [-pi,pi]"""
dphi = phi2 - phi1
dphi[dphi > np.pi] -= 2*np.pi
dphi[dphi < -np.pi] += 2*np.pi
return dphi
def split_detector_sections(hits, phi_edges, eta_edges):
"""Split hits according to provided phi and eta boundaries."""
hits_sections = []
# Loop over sections
for i in range(len(phi_edges) - 1):
phi_min, phi_max = phi_edges[i], phi_edges[i+1]
# Select hits in this phi section
phi_hits = hits[(hits.phi > phi_min) & (hits.phi < phi_max)]
# Center these hits on phi=0
centered_phi = phi_hits.phi - (phi_min + phi_max) / 2
phi_hits = phi_hits.assign(phi=centered_phi, phi_section=i)
for j in range(len(eta_edges) - 1):
eta_min, eta_max = eta_edges[j], eta_edges[j+1]
# Select hits in this eta section
eta = calc_eta(phi_hits.r, phi_hits.z)
sec_hits = phi_hits[(eta > eta_min) & (eta < eta_max)]
hits_sections.append(sec_hits.assign(eta_section=j))
return hits_sections
def select_hits(hits, truth, particles, pt_min=0):
# Barrel volume and layer ids
vlids = [(8,2), (8,4), (8,6), (8,8),
(13,2), (13,4), (13,6), (13,8),
(17,2), (17,4)]
n_det_layers = len(vlids)
# Select barrel layers and assign convenient layer number [0-9]
vlid_groups = hits.groupby(['volume_id', 'layer_id'])
hits = pd.concat([vlid_groups.get_group(vlids[i]).assign(layer=i)
for i in range(n_det_layers)])
# Calculate PER-HIT particle transverse momentum
pt = np.sqrt(truth.tpx**2 + truth.tpy**2)
truth = truth.assign(pt=pt)
# Calculate derived hits variables
r = np.sqrt(hits.x**2 + hits.y**2)
phi = np.arctan2(hits.y, hits.x)
# Select the data columns we need
hits = (hits[['hit_id', 'x', 'y', 'z', 'layer']]
.assign(r=r, phi=phi)
.merge(truth[['hit_id', 'particle_id', 'pt']], on='hit_id'))
# Remove noise
hits = hits[hits.particle_id != 0]
return hits
def select_segments(hits1, hits2, phi_slope_max, z0_max):
"""
Construct a list of selected segments from the pairings
between hits1 and hits2, filtered with the specified
phi slope and z0 criteria.
Returns: pd DataFrame of (index_1, index_2), corresponding to the
DataFrame hit label-indices in hits1 and hits2, respectively.
"""
# Start with all possible pairs of hits
keys = ['evtid', 'r', 'phi', 'z', 'particle_id', 'hit_id']
hit_pairs = hits1[keys].reset_index().merge(
hits2[keys].reset_index(), on='evtid', suffixes=('_1', '_2'))
# Compute line through the points
dphi = calc_dphi(hit_pairs.phi_1, hit_pairs.phi_2)
dz = hit_pairs.z_2 - hit_pairs.z_1
dr = hit_pairs.r_2 - hit_pairs.r_1
phi_slope = dphi / dr
z0 = hit_pairs.z_1 - hit_pairs.r_1 * dz / dr
# Filter segments according to criteria
good_seg_mask = (phi_slope.abs() < phi_slope_max) & (z0.abs() < z0_max)
return hit_pairs[['index_1', 'index_2']][good_seg_mask]
def construct_graph(hits, layer_pairs,
phi_slope_max, z0_max,
feature_names,
feature_scale):
layer_groups = hits.groupby('layer')
segments = []
for (layer1, layer2) in layer_pairs:
# Find and join all hit pairs
try:
hits1 = layer_groups.get_group(layer1)
hits2 = layer_groups.get_group(layer2)
# If an event has no hits on a layer, we get a KeyError.
# In that case we just skip to the next layer pair
except KeyError as e:
logging.info('skipping empty layer: %s' % e)
continue
# Construct the segments
segments.append(select_segments(hits1, hits2, phi_slope_max, z0_max))
# Combine segments from all layer pairs
segments = pd.concat(segments)
# print("Segments selected in", event_file[-4:])
X = (hits[feature_names].values / feature_scale).astype(np.float32)
pid = (hits['particle_id'].values).astype(np.int64)
I = (hits['hit_id'].values).astype(np.int64)
n_edges = len(segments)
n_hits = len(hits)
pid1 = hits.particle_id.loc[segments.index_1].values
pid2 = hits.particle_id.loc[segments.index_2].values
y = np.zeros(n_edges, dtype=np.float32)
y[:] = (pid1 == pid2)
hit_idx = pd.Series(np.arange(n_hits), index=hits.index)
seg_start = hit_idx.loc[segments.index_1].values
seg_end = hit_idx.loc[segments.index_2].values
e = np.vstack([seg_start, seg_end])
data = Data(x = torch.from_numpy(X).float(), edge_index = torch.from_numpy(e), y = torch.from_numpy(y), I = torch.from_numpy(I), pid=torch.from_numpy(pid))
return data
def build_event(event_file, pt_min, phi_slope_max, z0_max, feature_names, feature_scale, n_phi_sections=1, n_eta_sections=1):
hits, particles, truth = trackml.dataset.load_event(
event_file, parts=['hits', 'particles', 'truth'])
hits = select_hits(hits, truth, particles, pt_min=pt_min).assign(evtid=int(event_file[-9:]))
phi_range, eta_range = [-np.pi, np.pi], [-5, 5]
phi_edges = np.linspace(*phi_range, num=n_phi_sections+1)
eta_edges = np.linspace(*eta_range, num=n_eta_sections+1)
hits_sections = split_detector_sections(hits, phi_edges, eta_edges)
# Define adjacent layers
n_det_layers = 10
l = np.arange(n_det_layers)
layer_pairs = np.stack([l[:-1], l[1:]], axis=1)
graphs_all = [construct_graph(section_hits, layer_pairs=layer_pairs,
phi_slope_max=phi_slope_max, z0_max=z0_max,
feature_names=feature_names,
feature_scale=feature_scale)
for section_hits in hits_sections]
return graphs_all
def prepare_event(event_file, pt_min, phi_slope_max, z0_max, feature_names, feature_scale, n_phi_sections=1, iter=None, num_samples=None, out=None):
graphs_all = build_event(event_file, pt_min, phi_slope_max, z0_max, feature_names, feature_scale, n_phi_sections)
if iter is not None and num_samples is not None:
out.update(progress(iter, num_samples))
return graphs_all
HC_input_dir = "/global/cscratch1/sd/danieltm/ExaTrkX/trackml/train_all/"
all_events = os.listdir(HC_input_dir)
all_events = [HC_input_dir + event[:14] for event in all_events]
all_events = list(set(all_events))
all_events.sort()
train_size, test_size = 1, 1
out = display(progress(0, train_size), display_id=True)
HC_train_dataset = [prepare_event(event_file, pt_min, phi_slope_max, z0_max, feature_names, feature_scale, n_phi_sections, iter, train_size, out) for (event_file, iter) in zip(all_events[:train_size], range(train_size))]
HC_train_dataset = [datapoint for dataset in HC_train_dataset for datapoint in dataset]
out = display(progress(0, test_size), display_id=True)
HC_test_dataset = [prepare_event(event_file, pt_min, phi_slope_max, z0_max, feature_names, feature_scale, n_phi_sections, iter, test_size, out) for (event_file, iter) in zip(all_events[-test_size:], range(test_size))]
HC_test_dataset = [datapoint for dataset in HC_test_dataset for datapoint in dataset]
HC_train_loader = DataLoader(HC_train_dataset, batch_size=2, shuffle=True)
HC_test_loader = DataLoader(HC_test_dataset, batch_size=2, shuffle=True)
Finding Brian:
g1 = HC_test_loader.dataset[7]
g2 = HC_test_loader.dataset[6]
X = g1.x.numpy() * feature_scale
hits = np.vstack([X.T, g1.I.numpy(), g1.pid.numpy()]).T
sum(hits[:,4] == brian)
np.where(np.unique(hits[:,4])==brian)
plt.figure(figsize=(10,10))
for pid in np.unique(hits[:,4])[1170:1180]:
pid_hits = hits[hits[:,4] == pid]
x = pid_hits[:,0] * np.cos(pid_hits[:,1])
y = pid_hits[:,0] * np.sin(pid_hits[:,1])
plt.scatter(x, y, s=50)
Where our Brian is the grey track above.
def plot_brian(event, feature_scale, brian, phi_min = -np.pi, phi_max = np.pi, r_min = 0, r_max = 1200):
X = event.x.numpy() * feature_scale
X_filter = (X[:,1] > phi_min) & (X[:,1] < phi_max) & (X[:,0] >= r_min) & (X[:,0] <= r_max)
brian_filter = event.pid.numpy() == brian
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
plt.figure(figsize=(10,10))
e = event.edge_index.numpy()
brian_edges = e[:,(X_filter[e[0,:]]) & (X_filter[e[1,:]]) & (brian_filter[e[0,:]]) & (brian_filter[e[1,:]])]
print(brian_edges)
e = e[:,(X_filter[e[0,:]]) & (X_filter[e[1,:]])]
X_ind = np.arange(len(X))
e_filter = (np.isin(X_ind, e[0,:])) | (np.isin(X_ind, e[1,:]))
plt.plot([x[e[0,:]], x[e[1,:]]], [y[e[0,:]], y[e[1,:]]], c='b', alpha=0.01)
plt.plot([x[brian_edges[0,:]], x[brian_edges[1,:]]], [y[brian_edges[0,:]], y[brian_edges[1,:]]], c='r', alpha=0.9)
plt.scatter(x[X_filter & e_filter & ~brian_filter], y[X_filter & e_filter & ~brian_filter], c='k', alpha=0.01)
plt.scatter(x[X_filter & brian_filter], y[X_filter & brian_filter], c='r', s=50, alpha=0.9)
%%time
plot_brian(g1, feature_scale, brian, r_min = 80, r_max=650)
%%time
plot_brian(g2, feature_scale, brian, r_min = 0, r_max=1400)
Clearly, the heuristic construction didn't nail Brian. Even before classification, he is only half constructed, split between graphs and with some hits too far from each other to connect with doublets.
def reset_hit_indices(X, e, I, pid):
U = list(set(e[0,:]) | set(e[1,:]))
newX = X[U]
newI = I[U]
newpid = pid[U]
Reverse_U = np.zeros(len(X), dtype=int)
Reverse_U[U] = np.arange(len(U))
newE = np.zeros((2,e.shape[1]), dtype=int)
newE[0,:] = Reverse_U[e[0,:]]
newE[1,:] = Reverse_U[e[1,:]]
return newX, newE, newI, newpid
def augment_graph(e, y):
low_edge_count = [k for k,v in Counter(np.hstack([e[0,:], e[1,:]])).items() if v < 4]
new_edges = []
for i, hit in enumerate(low_edge_count):
random_edge = random.choice(low_edge_count[:i]+low_edge_count[i+1:])
new_edges.append([hit, random_edge])
new_edges = np.array(new_edges).T
e = np.hstack((e, new_edges))
y = np.hstack([y, np.zeros(new_edges.shape[1])])
return e, y
def construct_AE_graph(X, e, y, I, pid, norm_phi_min, delta, n_phi_sections, augmented):
"""Construct one graph (e.g. from one event)"""
""" Masks out the edges and hits falling within [phi_min, phi_min + delta_phi]"""
# Mask out phi segment edges and truths, leaving full hit list
seg_hits = (X[:,1] >= norm_phi_min) & (X[:,1] < (norm_phi_min + delta))
seg_edges = seg_hits[e[1,:]] # Whether to filter by in or out end may impact training, explore this further!
e_sec = e[:, seg_edges]
y_sec = y[seg_edges]
# Prepare the features data types
y_sec = y_sec.astype(np.float32)
# Option to augment graph with random connections
if augmented:
e_sec, y_sec = augment_graph(e_sec, y_sec)
# Reset hit indices to avoid unused hits
X, e_sec, I, pid = reset_hit_indices(X, e_sec, I, pid)
# Center phi at 0
X[:,1] = X[:,1] - norm_phi_min - (delta/2)
# Handle case of crossing the boundary
X[X[:,1] < (-n_phi_sections), 1] += 2*n_phi_sections
X[X[:,1] > (n_phi_sections), 1] -= 2*n_phi_sections
graph = Data(x = torch.from_numpy(X).float(), edge_index = torch.from_numpy(e_sec), y = torch.from_numpy(y_sec), I = torch.from_numpy(I), pid = torch.from_numpy(pid))
# Return a tuple of the results
return graph
def prepare_AE_event(event, feature_names, feature_scale, pt_min, n_phi_sections=1, iter=None, num_samples=None, out=None, augmented=False):
event = np.load(event, allow_pickle=True)
hits, truth, e, scores = event['hits'], event['truth'].reset_index(drop=True), event['neighbors'], event['scores']
# Calculate derived hits variables
r = np.sqrt(hits.x**2 + hits.y**2)
phi = np.arctan2(hits.y, hits.x)
# Select the data columns we need
hits = hits.assign(r=r, phi=phi)
I = hits['hit_id'].to_numpy().astype(np.float32)
X = (hits[feature_names].values / feature_scale).astype(np.float32)
truth = truth.to_numpy()
pt_mask = truth[:,9] > pt_min
e = e[:,pt_mask[e[0,:]] & pt_mask[e[1,:]]]
# Remove duplicate edges
r_mask = X[e[0,:],0] > X[e[1,:],0]
e[0,r_mask], e[1,r_mask] = e[1,r_mask], e[0,r_mask]
e = np.array(list(set(list(zip(e[0,:], e[1,:]))))).T
pid = truth[:,1]
y = (pid[e[0,:]] == pid[e[1,:]]).astype(np.float32)
data = []
delta = 2
for i in range(n_phi_sections):
norm_phi_min = -n_phi_sections + (i*delta)
graph = construct_AE_graph(X, e, y, I, pid, norm_phi_min, delta, n_phi_sections, augmented=augmented)
data.append(graph)
if iter is not None and num_samples is not None:
out.update(progress(iter, num_samples))
return data
We will load graphs constructed using an Adjacent Embedding (AE) model. That is, the embedding is trained on hits from adjacent layers, but may not necessarily construct edges only between adjacent layers. Let's examine what it gives us.
AE_input_dir = "/global/cscratch1/sd/danieltm/ExaTrkX/processed_sparse/adjacent/graphs"
all_events = os.listdir(AE_input_dir)
all_events = [os.path.join(AE_input_dir, event) for event in all_events]
all_events.sort()
%%time
train_size, test_size = 1, 1
out = display(progress(0, train_size), display_id=True)
AE_train_dataset = [prepare_AE_event(event, feature_names, feature_scale, pt_min, n_phi_sections, iter, train_size, out) for (iter, event) in enumerate(all_events[:train_size])]
AE_train_dataset = [datapoint for dataset in AE_train_dataset for datapoint in dataset]
out = display(progress(0, test_size), display_id=True)
AE_test_dataset = [prepare_AE_event(event, feature_names, feature_scale, pt_min, n_phi_sections, iter, test_size, out) for (iter, event) in enumerate(all_events[-test_size:])]
AE_test_dataset = [datapoint for dataset in AE_test_dataset for datapoint in dataset]
AE_train_loader = DataLoader(AE_train_dataset, batch_size=2, shuffle=True)
AE_test_loader = DataLoader(AE_test_dataset, batch_size=2, shuffle=True)
g1 = AE_test_loader.dataset[7]
g2 = AE_test_loader.dataset[6]
X = g1.x.numpy() * feature_scale
hits = np.vstack([X.T, g1.I.numpy(), g1.pid.numpy()]).T
sum(hits[:,4] == brian)
np.where(np.unique(hits[:,4])==brian)
Where is Brian:
len(np.unique(hits[:,4]))
plt.figure(figsize=(10,10))
for pid in np.unique(hits[:,4])[1330:1340]:
pid_hits = hits[hits[:,4] == pid]
x = pid_hits[:,0] * np.cos(pid_hits[:,1])
y = pid_hits[:,0] * np.sin(pid_hits[:,1])
plt.scatter(x, y, s=50)
Our friend should be recognisable as the brown track. The entire track is in this event, which is encouraging - it means that the embedding thought the edges connecting the two furthest hits were likely even though the hits were outside of the naive division of the event. A visualisation will help:
%%time
plot_brian(g1, feature_scale, brian)
Observe that Brian was not connected in this graph because we (arbitrarily) choose to not connect outside the graph split heading outwards, only inwards. We can examine this decision later... For now, look at the other graph containing Brian.
%%time
plot_brian(g2, feature_scale, brian)
A lot to unpack here. A) There are some self-edges. They will be removed by the recent push, I hope. There are also doublets between hits on the same layer.
HC_data = HC_test_loader.dataset[0]
HC_e, HC_X = HC_data['edge_index'].numpy(), HC_data['x'].numpy()*feature_scale
AE_data = AE_test_loader.dataset[0]
AE_e, AE_X = AE_data['edge_index'].numpy(), AE_data['x'].numpy()*feature_scale
plt.figure(figsize=(10,10))
ax = sns.distplot(list(Counter(np.hstack([HC_e[0,:], HC_e[1,:]])).values()), kde=False, label="Heuristic", bins=np.linspace(0, 25))
ax = sns.distplot(list(Counter(np.hstack([AE_e[0,:], AE_e[1,:]])).values()), kde=False, label="Embedding", bins=np.linspace(0, 25))
# ax = sns.distplot(list(Counter(np.hstack([AAE_e[0,:], AAE_e[1,:]])).values()), kde=False, label="Augmented Embedding", bins=np.linspace(0, 25))
ax.set(ylabel='Count')
ax.legend()
ax.set_title("Distribution of Edges per Hit")
plt.show()
Clearly, there is a different topology for the two graphs. The "small world"-ness of the heuristic graph is much greater than the embedded graph: Messages can be passed around much more freely.
Let's try constructing the track by keeping all connecting edges between graphs.
def construct_AE_graph(X, e, y, I, pid, norm_phi_min, delta, n_phi_sections, augmented):
"""Construct one graph (e.g. from one event)"""
""" Masks out the edges and hits falling within [phi_min, phi_min + delta_phi]"""
# Mask out phi segment edges and truths, leaving full hit list
seg_hits = (X[:,1] >= norm_phi_min) & (X[:,1] < (norm_phi_min + delta))
seg_edges = seg_hits[e[1,:]] | seg_hits[e[0,:]] # Whether to filter by in or out end may impact training, explore this further!
e_sec = e[:, seg_edges]
y_sec = y[seg_edges]
# Prepare the features data types
y_sec = y_sec.astype(np.float32)
# Option to augment graph with random connections
if augmented:
e_sec, y_sec = augment_graph(e_sec, y_sec)
# Reset hit indices to avoid unused hits
X, e_sec, I, pid = reset_hit_indices(X, e_sec, I, pid)
# Center phi at 0
X[:,1] = X[:,1] - norm_phi_min - (delta/2)
# Handle case of crossing the boundary
X[X[:,1] < (-n_phi_sections), 1] += 2*n_phi_sections
X[X[:,1] > (n_phi_sections), 1] -= 2*n_phi_sections
graph = Data(x = torch.from_numpy(X).float(), edge_index = torch.from_numpy(e_sec), y = torch.from_numpy(y_sec), I = torch.from_numpy(I), pid = torch.from_numpy(pid))
# Return a tuple of the results
return graph
%%time
train_size, test_size = 1, 1
out = display(progress(0, train_size), display_id=True)
AE_train_dataset = [prepare_AE_event(event, feature_names, feature_scale, pt_min, n_phi_sections, iter, train_size, out) for (iter, event) in enumerate(all_events[:train_size])]
AE_train_dataset = [datapoint for dataset in AE_train_dataset for datapoint in dataset]
out = display(progress(0, test_size), display_id=True)
AE_test_dataset = [prepare_AE_event(event, feature_names, feature_scale, pt_min, n_phi_sections, iter, test_size, out) for (iter, event) in enumerate(all_events[-test_size:])]
AE_test_dataset = [datapoint for dataset in AE_test_dataset for datapoint in dataset]
AE_train_loader = DataLoader(AE_train_dataset, batch_size=2, shuffle=True)
AE_test_loader = DataLoader(AE_test_dataset, batch_size=2, shuffle=True)
g1 = AE_test_loader.dataset[7]
g2 = AE_test_loader.dataset[6]
X = g1.x.numpy() * feature_scale
hits = np.vstack([X.T, g1.I.numpy(), g1.pid.numpy()]).T
sum(hits[:,4] == brian)
Finding Brian
np.where(np.unique(hits[:,4])==brian)
plt.figure(figsize=(10,10))
for pid in np.unique(hits[:,4])[1460:1470]:
pid_hits = hits[hits[:,4] == pid]
x = pid_hits[:,0] * np.cos(pid_hits[:,1])
y = pid_hits[:,0] * np.sin(pid_hits[:,1])
plt.scatter(x, y, s=50)
Our friend should be recognisable as the brown track. The entire track is in this event, which is encouraging - it means that the embedding thought the edges connecting the two furthest hits were likely even though the hits were outside of the naive division of the event. A visualisation will help:
%%time
plot_brian(g1, feature_scale, brian)
And now, Brian is fully covered by this graph. We got a little lucky, in that he fits on one eighth of the graph, given how much curvature the track has.
%%time
plot_brian(g2, feature_scale, brian)
HC_data = HC_test_loader.dataset[0]
HC_e, HC_X = HC_data['edge_index'].numpy(), HC_data['x'].numpy()*feature_scale
AE_data = AE_test_loader.dataset[0]
AE_e, AE_X = AE_data['edge_index'].numpy(), AE_data['x'].numpy()*feature_scale
plt.figure(figsize=(10,10))
ax = sns.distplot(list(Counter(np.hstack([HC_e[0,:], HC_e[1,:]])).values()), kde=False, label="Heuristic", bins=np.linspace(0, 25))
ax = sns.distplot(list(Counter(np.hstack([AE_e[0,:], AE_e[1,:]])).values()), kde=False, label="Embedding", bins=np.linspace(0, 25))
# ax = sns.distplot(list(Counter(np.hstack([AAE_e[0,:], AAE_e[1,:]])).values()), kde=False, label="Augmented Embedding", bins=np.linspace(0, 25))
ax.set(ylabel='Count')
ax.legend()
ax.set_title("Distribution of Edges per Hit")
plt.show()
Note that with this choice of split, the edge distribution is shifted to the right. That's good - more edges per hit let the GNN pass messages better.
# Load by directory (preferred)
result_base = os.path.expandvars('$SCRATCH/ExaTrkX/processed_sparse/results/')
result_name = 'high_003'
result_dir = os.path.join(result_base, result_name)
config = load_config_dir(result_dir)
print('Configuration:')
pprint.pprint(config)
summaries = load_summaries(config)
best_idx = summaries.valid_loss.idxmin()
print('\nTraining summaries:')
summaries
# Build the trainer and load best checkpoint
trainer = get_trainer(output_dir=config['output_dir'], gpu=0, **config['trainer'])
trainer.build_model(optimizer_config=config['optimizer'], **config['model'])
best_epoch = summaries.epoch.loc[best_idx]
trainer.load_checkpoint(checkpoint_id=best_epoch)
print(trainer.model)
print('Parameters:', sum(p.numel() for p in trainer.model.parameters()))
# Load the test dataset
n_test = 80
test_loader, filelist = get_test_data_loader(config, n_test=n_test)
%%time
# Apply the model
test_preds, test_targets = trainer.device_predict(test_loader)
i = 0
g = test_loader.dataset[i]
pid = np.load(filelist[g.i][:-4]+"_ID.npz", allow_pickle=True)["pid"]
X, e, y, o = g.x.numpy()*feature_scale, g.edge_index.numpy(), g.y.numpy(), test_preds[i].numpy()
filelist[g.i]
def draw_sample_brian_xy(hits, edges, preds, brian, pid, labels, cut=0.5, figsize=(16, 16)):
x = hits[:,0] * np.cos(hits[:,1])
y = hits[:,0] * np.sin(hits[:,1])
fig, ax0 = plt.subplots(figsize=figsize)
brian_filter = pid == brian
p_brian_edges = edges[:,(brian_filter[edges[0,:]]) & (brian_filter[edges[1,:]]) & (preds > cut)]
n_brian_edges = edges[:,(brian_filter[edges[0,:]]) & (brian_filter[edges[1,:]]) & (preds < cut)]
# Draw the hits
ax0.scatter(x, y, s=2, c='k', alpha=0.1)
# Draw the segments
for j in range(labels.shape[0]):
# False negatives
if preds[j] < cut and labels[j] > cut:
ax0.plot([x[edges[0,j]], x[edges[1,j]]],
[y[edges[0,j]], y[edges[1,j]]],
'--', c='b', alpha=0.1)
# False positives
if preds[j] > cut and labels[j] < cut:
ax0.plot([x[edges[0,j]], x[edges[1,j]]],
[y[edges[0,j]], y[edges[1,j]]],
'-', c='r', alpha=0.1)
# True positives
if preds[j] > cut and labels[j] > cut:
ax0.plot([x[edges[0,j]], x[edges[1,j]]],
[y[edges[0,j]], y[edges[1,j]]],
'-', c='k', alpha=0.1)
ax0.plot([x[p_brian_edges[0,:]], x[p_brian_edges[1,:]]], [y[p_brian_edges[0,:]], y[p_brian_edges[1,:]]], c='k', linewidth=3, alpha=0.9)
ax0.plot([x[n_brian_edges[0,:]], x[n_brian_edges[1,:]]], [y[n_brian_edges[0,:]], y[n_brian_edges[1,:]]], c='r', linewidth=3, alpha=0.9)
ax0.scatter(x[brian_filter], y[brian_filter], c='k', s=50, alpha=0.9)
return fig, ax0
%%time
draw_sample_xy(X, e, o, y)
np.where(pid[e[0,o>0.5]]==brian)
np.where(pid[e[1,o>0.5]]==brian)
e_pid = pid[e[0,:]] * y
cut = 0.5
p_edges = e[:,o>cut]
n_edges = e[:,o<cut]
brian_p_edges = p_edges[:,e_pid[o>cut]==brian]
brian_n_edges = n_edges[:,e_pid[o<cut]==brian]
full_brian = np.hstack([brian_n_edges,brian_p_edges])
brian_p_edges
brian_n_edges
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
for edge in full_brian.T[:]:
plt.plot([x[edge[0]], x[edge[1]]], [y[edge[0]], y[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(x[edge], y[edge])
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
for edge in brian_p_edges.T[:]:
plt.plot([x[edge[0]], x[edge[1]]], [y[edge[0]], y[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(x[edge], y[edge])
# plt.plot([x[n_brian_edges[0,:]], x[n_brian_edges[1,:]]], [y[n_brian_edges[0,:]], y[n_brian_edges[1,:]]], c='r', linewidth=2, alpha=0.9)
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
for edge in brian_n_edges.T[:]:
plt.plot([x[edge[0]], x[edge[1]]], [y[edge[0]], y[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(x[edge], y[edge])
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
plt.plot([x[brian_p_edges[0,:]], x[brian_p_edges[1,:]]], [y[brian_p_edges[0,:]], y[brian_p_edges[1,:]]], c='k', linewidth=2, alpha=0.9)
plt.plot([x[brian_n_edges[0,:]], x[brian_n_edges[1,:]]], [y[brian_n_edges[0,:]], y[brian_n_edges[1,:]]], c='r', linewidth=2, alpha=0.9)
%%time
draw_sample_brian_xy(X, e, o, brian, pid, y, cut=0.5)
The takeaway: Brian has been quite successfully classified. To be clear, he was not chosen as a track that would be easily classified, he was chosen at random as a difficult track a priori without knowing the success of his classification. He is still missing one segment however. So we can attempt to classify him with a triplet representation.
We perform a very similar analyses as in the doublet case, but using the doublet scores to train the triplet scores.
# Load by directory (preferred)
result_base = os.path.expandvars('$SCRATCH/ExaTrkX/processed_sparse/results/triplets')
result_name = 'high_lowcut_005'
result_dir = os.path.join(result_base, result_name)
config = load_config_dir(result_dir)
print('Configuration:')
pprint.pprint(config)
summaries = load_summaries(config)
best_idx = summaries.valid_loss.idxmin()
print('\nTraining summaries:')
summaries
# Build the trainer and load best checkpoint
trainer = get_trainer(output_dir=config['output_dir'], gpu=0, **config['trainer'])
trainer.build_model(optimizer_config=config['optimizer'], **config['model'])
best_epoch = summaries.epoch.loc[best_idx]
trainer.load_checkpoint(checkpoint_id=best_epoch)
print(trainer.model)
print('Parameters:', sum(p.numel() for p in trainer.model.parameters()))
# Load the test dataset
n_test = 80
test_loader, filelist = get_test_data_loader(config, n_test=n_test)
%%time
# Apply the model
test_preds, test_targets = trainer.device_predict(test_loader)
i = 0
g = test_loader.dataset[i]
pid = np.load(filelist[g.i][:-4]+"_ID.npz", allow_pickle=True)["pid"]
X, e, y, o = g.x.numpy()*np.hstack([feature_scale, feature_scale, 1]), g.edge_index.numpy(), g.y.numpy(), test_preds[i].numpy()
len(pid.nonzero()[0])
%%time
draw_triplets_tf_mul_xy(X, e, o, y)
%%time
draw_triplets_xy_antiscore_cut_edges(X, e, o, y)
%%time
draw_triplets_xy_antiscore_cut(X, e, o, y)
%%time
draw_triplets_xy(X, e, o, y)
Brian should have 14 triplet components.
# Throw this away later!!!!!!!!!!!!!!!!
brian = pid[pid>0][1000]
cut=0.5
brians = []
for b in np.unique(pid[pid>0]):
brian = b
if len(np.where(pid[e[0,o>cut]] == brian)[0]) > 8: brians.append(brian)
len(brians)
brian = brians[0]
np.where(pid[e[0,o>0.5]]==brian)
np.where(pid[e[1,o>0.5]]==brian)
np.where(pid[e[0,o<0.5]]==brian)
np.where(pid[e[1,o<0.5]]==brian)
e_pid = pid[e[0,:]] * y
cut = 0.5
p_edges = e[:,o>cut]
n_edges = e[:,o<cut]
brian_p_edges = p_edges[:,e_pid[o>cut]==brian]
brian_n_edges = n_edges[:,e_pid[o<cut]==brian]
full_brian = np.hstack([brian_n_edges,brian_p_edges])
brian_p_edges
brian_n_edges
full_brian
plt.figure(figsize=(20,10))
xi, yi = X[:,0] * np.cos(X[:,1]), X[:,0] * np.sin(X[:,1])
xo, yo = X[:,3] * np.cos(X[:,4]), X[:,3] * np.sin(X[:,4])
for edge in full_brian.T[:]:
plt.plot([xi[edge[0]], xi[edge[1]]], [yi[edge[0]], yi[edge[1]]], linewidth=1, alpha=0.9)
plt.plot([xo[edge[0]], xo[edge[1]]], [yo[edge[0]], yo[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(xi[edge], yi[edge])
plt.scatter(xo[edge], yo[edge])
plt.figure(figsize=(20,10))
xi, yi = X[:,0] * np.cos(X[:,1]), X[:,0] * np.sin(X[:,1])
xo, yo = X[:,3] * np.cos(X[:,4]), X[:,3] * np.sin(X[:,4])
for edge in brian_p_edges.T[:]:
plt.plot([xi[edge[0]], xi[edge[1]]], [yi[edge[0]], yi[edge[1]]], linewidth=1, alpha=0.9)
plt.plot([xo[edge[0]], xo[edge[1]]], [yo[edge[0]], yo[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(xi[edge], yi[edge])
plt.scatter(xo[edge], yo[edge])
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
for edge in brian_n_edges.T[:]:
plt.plot([x[edge[0]], x[edge[1]]], [y[edge[0]], y[edge[1]]], linewidth=1, alpha=0.9)
plt.scatter(x[edge], y[edge])
plt.figure(figsize=(20,10))
x = X[:,0] * np.cos(X[:,1])
y = X[:,0] * np.sin(X[:,1])
plt.plot([x[brian_p_edges[0,:]], x[brian_p_edges[1,:]]], [y[brian_p_edges[0,:]], y[brian_p_edges[1,:]]], c='k', linewidth=2, alpha=0.9)
plt.plot([x[brian_n_edges[0,:]], x[brian_n_edges[1,:]]], [y[brian_n_edges[0,:]], y[brian_n_edges[1,:]]], c='r', linewidth=2, alpha=0.9)
def draw_brian_triplets_xy(hits, edges, preds, brian, pid, labels, cut=0.5, figsize=(16, 16)):
xi, yi = [hits[:,0] * np.cos(hits[:,1]), hits[:,0] * np.sin(hits[:,1])]
xo, yo = [hits[:,3] * np.cos(hits[:,4]), hits[:,3] * np.sin(hits[:,4])]
fig, ax0 = plt.subplots(figsize=figsize)
brian_filter = pid == brian
p_brian_edges = edges[:,(brian_filter[edges[0,:]]) & (brian_filter[edges[1,:]]) & (preds > cut)]
n_brian_edges = edges[:,(brian_filter[edges[0,:]]) & (brian_filter[edges[1,:]]) & (preds < cut)]
#Draw the hits
ax0.scatter(xi, yi, s=2, c='k')
# Draw the segments
for j in range(labels.shape[0]):
# False negatives
if preds[j] < cut and labels[j] > cut:
ax0.plot([xi[edges[0,j]], xo[edges[0,j]]],
[yi[edges[0,j]], yo[edges[0,j]]],
'--', c='b', alpha=0.6)
ax0.plot([xi[edges[1,j]], xo[edges[1,j]]],
[yi[edges[1,j]], yo[edges[1,j]]],
'--', c='b', alpha=0.6)
# False positives
if preds[j] > cut and labels[j] < cut:
ax0.plot([xi[edges[0,j]], xo[edges[0,j]]],
[yi[edges[0,j]], yo[edges[0,j]]],
'-', c='r', alpha=0.6)
ax0.plot([xi[edges[1,j]], xo[edges[1,j]]],
[yi[edges[1,j]], yo[edges[1,j]]],
'-', c='r', alpha=0.6)
# True positives
if preds[j] > cut and labels[j] > cut:
ax0.plot([xi[edges[0,j]], xo[edges[0,j]]],
[yi[edges[0,j]], yo[edges[0,j]]],
'-', c='k', alpha=0.01)
ax0.plot([xi[edges[1,j]], xo[edges[1,j]]],
[yi[edges[1,j]], yo[edges[1,j]]],
'-', c='k', alpha=0.01)
ax0.plot([xi[p_brian_edges[0,:]], xi[p_brian_edges[1,:]]], [yi[p_brian_edges[0,:]], yi[p_brian_edges[1,:]]], c='k', linewidth=3, alpha=0.9)
ax0.plot([xi[n_brian_edges[0,:]], xi[n_brian_edges[1,:]]], [yi[n_brian_edges[0,:]], yi[n_brian_edges[1,:]]], c='r', linewidth=3, alpha=0.9)
ax0.plot([xo[p_brian_edges[0,:]], xo[p_brian_edges[1,:]]], [yo[p_brian_edges[0,:]], yo[p_brian_edges[1,:]]], c='k', linewidth=3, alpha=0.9)
ax0.plot([xo[n_brian_edges[0,:]], xo[n_brian_edges[1,:]]], [yo[n_brian_edges[0,:]], yo[n_brian_edges[1,:]]], c='r', linewidth=3, alpha=0.9)
ax0.scatter(xi[brian_filter], yi[brian_filter], c='k', s=50, alpha=0.9)
ax0.scatter(xo[brian_filter], yo[brian_filter], c='k', s=50, alpha=0.9)
return fig, ax0
e
o > 0.5
len(y)
%%time
n_edges = 20830
draw_brian_triplets_xy(X, e[:,:n_edges], o[:n_edges], brian, pid, y[:n_edges], cut=0.5)
Now we see that Brian (thick black) is perfectly classified, with every combination of his triplets included as an above-threshold prediction.